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Importance sampling is a variance reduction technique for effi¬ 
cient estimation of rare-event probabilities by Monte Carlo. In stan¬ 
dard importance sampling schemes, the system is simulated using 
an a priori fixed change of measure suggested by a large deviation 
lower bound analysis. Recent work, however, has suggested that such 
schemes do not work well in many situations. In this paper we con¬ 
sider dynamic importance sampling in the setting of uniformly recur¬ 
rent Markov chains. By “dynamic” we mean that in the course of a 
single simulation, the change of measure can depend on the outcome 
of the simulation up till that time. Based on a control-theoretic ap¬ 
proach to large deviations, the existence of asymptotically optimal 
dynamic schemes is demonstrated in great generality. The implemen¬ 
tation of the dynamic schemes is carried out with the help of a limit¬ 
ing Bellman equation. Numerical examples are presented to contrast 
the dynamic and standard schemes. 

1. Introduction. Among variance reduction techniques for efficient Monte 
Carlo simulation is importance sampling, in which the data is generated 
using a probability distribution different from the true underlying distri¬ 
bution. It can be especially effective when applied to the estimation of ex¬ 
pectations that are largely determined by rare events. To demonstrate the 
difficulty involved in simulating rare events by naive Monte Carlo, we con¬ 
sider a simple example. Let X be a random variable taking values in M rf , and 
suppose we are interested in estimating p = P{X £ A} for some Borel set 
icl* 1 . To this end, a sequence of independent and identically distributed 


Received June 2003; revised November 2003. 

Supported in part by the NSF Grants DMS-00-72004, ECS-99-79250 and Army Re¬ 
search Office Grants DAAD19-00-1-0549 and DAAD19-02-1-0425. 

2 Supported in part by NSF Grant DMS-01-03669. 

AMS 2000 subject classifications. 60F10, 65C05, 93E20. 

Key words and phrases. Asymptotic optimality, importance sampling, Markov chain, 
Monte Carlo simulation, rare events, stochastic game, weak convergence. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Applied Probability , 

2005, Vol. 15, No. 1A, 1-38. This reprint differs from the original in pagination 
and typographic detail. 


1 




2 


P. DUPUIS AND H. WANG 


(i.i.d.) copies Xo,Xi,... of X are generated. With Ik = 1 rx k eA}i an unbi¬ 
ased estimate for p based on the first K samples is just the sample mean: 
Qk = (lo + h + • —I - Ik~i)/K. The relative error associated with this esti¬ 
mator is 

, . . standard deviation of Qk Vp — p 2 1 

relative error =-——-=-• ,_ _ 

mean of Qk p \JK 

Since Vp — p 2 /p —» oo as p tends 0, a large sample size K is required for 
the estimator Qk to achieve a reasonable relative error bound. For example, 
if p = 10~ 8 , ten billion samples are required to achieve a relative error bound 
of 10%. 

The basic idea of importance sampling is as follows. Suppose that X has 
distribution 0, and consider an alternative sampling distribution r. It is re¬ 
quired that 0 be absolutely continuous with respect to r, so that the Radon- 
Nikodym derivative f(x ) = (d6/dr)(x) exists. Independent and identically 
distributed samples Xo,Xi,... with distribution r are generated. Form the 
estimate 

1 K ~ 1 

Qk = jt ^2 /P0c)l{x fc eA} 
k =o 

in lieu of Qk- It is easy to check that Qk is an unbiased estimate of p, with 
a rate of convergence determined by 

var [f{X 0 )t { x o£A} ] = [ t {xeA} f{x)0(dx) - p 2 . 

*/]R 

The optimization of this quantity over all possible r is inappropriate. In¬ 
deed, taking f(x) = p^ 1 '^{ X £A} (i-e., r is the conditional distribution of X 
given JeA), the variance becomes 0, but this change of measure requires 
the knowledge of the unknown parameter p. Instead, one typically seeks to 
minimize over parameterized families of alternative sampling distributions. 

When the distribution of X is connected to a large deviations problem, 
a standard heuristic is that the change of measure used to prove the large de¬ 
viation lower bound should be a good (perhaps nearly optimal) distribution 
to use for the purposes of importance sampling. The first result of this type 
was given by Siegmund [34]. The basic idea was subsequently investigated in 
many contexts, and a small selection of the literally hundreds of papers on 
the topics is [[1, 2, 3, 7, 8, 9], [11, 12, 15, 17, 18, 20, 24, 25, 29, 30, 33]]. Nec¬ 
essary and sufficient conditions under which a prescribed scheme is asymp¬ 
totically optimal are discussed in [10, 31, 32], while [21] gives a survey of 
rare-event simulation. 

The validity of the heuristic, however, was challenged in [19]. Counterex¬ 
amples were constructed to show that, under some very common settings, 
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the change of measure suggested by large deviations leads to importance 
sampling schemes with very poor properties. 

In order to explain these counterexamples, and more importantly, to find 
asymptotically optimal importance sampling algorithms in great general¬ 
ity, [16] introduces a dynamic importance sampling scheme and shows its 
asymptotic optimality in the setup of i.i.d. random variables (Cramer’s the¬ 
orem). The key observation is that many changes of measure are suggested 
by the large deviation lower bound analysis, and one must consider this 
larger class if one hopes to identify importance sampling schemes that work 
well in general. This leads to the development of schemes where the sam¬ 
pling distribution is dynamic (or, “adaptive”) in the sense that the change 
of measure in the course of a single simulation can depend on the outcome 
of the simulation up till that time. For this reason, we also call such schemes 
adaptive importance sampling schemes. 

The present paper analyzes the estimation of rare-event probabilities as¬ 
sociated with uniformly recurrent Markov chains. More precisely, let {Yj,j E 
No} be a uniformly recurrent Markov chain taking values in a Polish space 
S, and let g: S —> W l be a bounded measurable function. Define S n = g(Yo) + 

g(Y \) -|- bg(Tn_i). The probability of interest is P{S n /n E A} for a Borel 

set A C W l and n large. An asymptotic optimality result for traditional 
importance sampling is available in the one-dimensional case, d= 1, under 
the assumption which implies that the set A is within a half interval that 
does not contain the expectation of g under the invariant distribution [9]. 
A “dissection” approach was introduced for the high-dimensional case [9]. 
This approach was later on applied to Markov additive sequences [11], and 
was also implicitly used in [19]. This dissection approach requires that one 
appropriately partition the set A into a finite number of subsets, and that a 
(possibly different) change of measure be applied to efficiently estimate the 
probability of each individual subset. However, there is no constructive way 
to obtain a suitable partition in general. 

In this paper we develop adaptive importance sampling schemes for uni¬ 
formly recurrent Markov chains. The existence of asymptotically optimal 
adaptive schemes is demonstrated for arbitrary dimension d, under very mild 
conditions on the set A. It turns out that one must study the asymptotics 
of a small noise stochastic game in order to analyze the optimality of im¬ 
portance sampling schemes. The distinction between the change of measures 
used in traditional importance sampling and adaptive importance sampling 
amounts, in control terminology, to the difference between “open-loop” and 
“feedback” controls. However, open loop controls are usually not optimal in 
the setting of stochastic games, except for very special cases. For this rea¬ 
son, the traditional importance sampling will not be asymptotically optimal 
in general. Our analysis indicates that the adaptive scheme also works for 
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estimating functionals (other than probabilities) largely determined by rare 
events. 

The paper is organized as follows. The setting of the problem is intro¬ 
duced in Section 2, with a brief description of the large deviations princi¬ 
ple for uniformly recurrent Markov chains. We also give the definition of 
asymptotic optimality in this section. In Section 3 we show that adaptive 
importance sampling schemes designed to minimize the second moment are 
asymptotically optimal. Section 4 discusses an alternative formal PDE ap¬ 
proach to the adaptive scheme, and describes a method for the construction 
of an asymptotically optimal adaptive scheme that does not directly depend 
on the large deviation parameter n. Numerical examples are presented in 
Section 4.3. Certain technical proofs are deferred to the appendices to ease 
exposition. 

2. Problem setup and background. 

2.1. Problem setup. Let Y = {Yj,j E No} be a time-homogeneous Markov 
chain taking values in a Polish space S, with transition probability kernel 

P(x,dy) = P{Y j+ 1 edy\Yj = x}. 

Let g:S — be a bounded Borel-measurable function, and define 
S n = g(Y 0 )+g(Y 1 ) + --- + g(Y n _ 1 ). 

For an arbitrary Borel set A C M d , we wish to estimate 

p n = P{S n /n E A}. 

Throughout the paper we will make use of the following uniform recurrency 
assumption. 

Condition 2.1. There exists a probability measure v p on S, an integer 
mo E N and a pair of strictly positive real numbers a, b such that 

av p (B) <p( m °\x,B ) < bu p (B) 

for all x E S and Borel sets B. Here denotes the m-step probability 
transition kernel. 

For example, an irreducible Markov chain with a finite state space is 
always uniformly recurrent. 

The large deviation principle for a uniformly recurrent Markov chain is 
well known. It asserts that {S n /n} satisfies the large deviation principle with 
a convex rate function L : —> [0, oo]. The identification of L is deferred to 
the next section. We will impose the following assumption throughout the 
paper. 
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Condition 2.2. The Borel set vlcK^ satisfies the condition 

inf L((3) = inf L(/3). 

f}&A v /3eA° 

Under Conditions 2.1 and 2.2, we have the large deviations approximation 
lim —log P{S n /n £ A} = — inf L(/3). 

n—>oc n /3eA 

Remark 2.1. The uniform recurrency assumption (Condition 2.1) is 
convenient to work with. It includes the important case of irreducible fi¬ 
nite state Markov chains, and generalizes the results in [16] where i.i.d. 
sequences were considered. However, this strong recurrency assumption also 
excludes many important Markov chains. One difficulty in extending the 
present results to more general Markov chains is that the uniform positivity 
and boundedness of the eigenfunctions (see Section 2.2) may not be pre¬ 
served [26, 27]. It is clear that generalization in this direction will require a 
much more involved analysis. 

2.2. LDP for a uniformly recurrent Markov chain. In this section we 
discuss two different approaches to the identification of the rate function L. 
The first approach suggests a parameterized family of change of measures 
(see Remark 2.2) that will be used later on to build importance sampling 
schemes. The second approach identifies the rate function L in terms of 
relative entropy, and will be used in the analysis of the asymptotic optimality 
of adaptive schemes. 

The first approach is based on a generalized Perron-Frobenius theorem. 
Fix any a £ W l . Then by [22], the nonnegative kernel 

exp {(a,g(y))}p(x,dy) 

admits a unique real eigenvalue exp{P7(a)} and a unique (up to a multi¬ 
plicative constant) eigenfunction r(x;a) in the sense that, for every x £ S, 

(2.1) [ e l ' a ’ 9 ^r(y,a)p(x,dy) = e H< -“V(x; a), 

Js 

and with the following properties. H(a ) is an analytic, strictly convex func¬ 
tion of a £ W l with H( 0) = 0, and there exist 0 < c a < C a < oo such that 

(2.2) c a < r(x\a) < C a \/x£S. 

The paper [22] also shows that the rate function of the large deviation prin¬ 
ciple for { S n /n} is the convex conjugate of H, that is, 

(2.3) L(/3) = sup l(a,/3) — H(a)]. 

a£ s. d 
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Note that in the special case when the Markov chain Y is an i.i.d. se¬ 
quence, H(a) is the logarithm moment generating function of g{Yj) and 
r(x;a) = 1. Therefore, this result generalizes the classical Cramer’s theo¬ 
rem, at least for bounded i.i.d. random variables. For the case when Y is an 
irreducible Markov chain with finite state space, exp{H (ct)} is just the max¬ 
imal eigenvalue of the irreducible nonnegative matrix exp{(a, g(y))}p(x, dy), 
and r(-\a) is the associated right eigenvector. 

Remark 2.2. It is not difficult to see that, thanks to (2.1), for each 
a 

exp {(a,g(y)) - H(a)} ■ r ^ y, °^ -p(x,dy) 

r(x; a) 

defines a probability transition kernel. 


Another approach is the weak convergence methodology which utilizes 
a stochastic control representation for certain exponential integrals [14]. It 
first identifies the large deviations rate function for the empirical measure 
of the Markov chain in the r-topology, then uses contraction principle to 
obtain the rate function for {S n /n}. We will need the following definitions. 

For an arbitrary Polish space Z, we denote by V(Z) the collection of all 
probability measures on space (Z,B(Z)). For a pair of probability measures 
7 ,/i€ V(Z), the relative entropy of 7 with respect to p is defined as 


{ f d'y 

L ioe i di ’ if7<<ft 

00 , otherwise. 

Given a probability transition kernel q(x, dy) on space Z, we define pq E 
V(Z), p® q E V(Z x Z) by 

pq{B) = Jz q(x, B) p(dx), 

( pZq)(DxB)= / p(dx)q(x,dy) = / q(x,B)p(dx) 

JdxB Jd 


for all Borel sets D, B C Z. The collection of all probability transition kernels 
on Z is denoted by T{Z). 

The weak convergence approach identifies the rate function for {S n /n} in 
terms of relative entropy: 


L(/3) = inf q\\p®p):p E V(S), 


q E T(S ), pq = p, gdp = /?|. 


(2.4) 
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The validity of the representation (2.4) is implied by the results in [14], 
Chapters 8 and 9, where the large deviation principle of the empirical mea¬ 
sures associated with Markov chains are studied under weaker assumptions. 

For future reference, we summarize the preceding discussion into the fol¬ 
lowing proposition. The only part that has not been mentioned is the super¬ 
linearity of the rate function L, which is an easy consequence of (2.3) and 
the finiteness of H ([14], Lemma 6.2.3(c)). 


Proposition 2.1. Under Condition 2.1, the sequence {S n /n} satis¬ 
fies the large deviation principle with rate function L, which is given by 
(2.3) and (2.4). Moreover, the rate function L is convex, lower-semicontinuous 
and superlinear in the sense that 


lim inf 

IV—>oo {/3gR d : |/3||>jV} 


m 

m 


= oo. 


In particular, L has compact level sets. 


2.3. Asymptotic optimality. In this section we define asymptotic opti¬ 
mality for an importance sampling scheme. 

Consider a probability space and a family of events {A n } with 


lim log P{A n } = '■y, 

n—>oo ti 

for some 7 > 0. A general formulation of importance sampling for this prob¬ 
lem can be described as follows. In order to estimate P{A n }, a generic 
random variable Z n is constructed such that P{A n } = EZ n . Independent 
replications {Z^,Z\, ..., Z^ -1 ) of Z n are then generated, and we obtain an 
estimator by averaging 


r\K z n + z l. + ' 


■ + z« 


-1 


K 


The estimator is unbiased, that is, EQ% = P{A n }. The rate of convergence 
associated with this estimator is determined by the variance of the sum¬ 
mands, or equivalently, their second moment E[(Z n ) 2 ]. The smaller the sec¬ 
ond moment, the faster the convergence, whence the smaller sample size K 
required. However, it follows from Jensen’s inequality that 

1 - 1 

lim sup-log E[(Z n ) 2 ] < lim-log (EZ n ) 2 = 2y. 

n —kx) Tl n >0 ° Ti 


The estimator Q^ is said to be asymptotically optimal if 

lim --logE[(Z n ) 2 } = 27 . 

n —>00 77 , 
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Remark 2.3. Since the performance of the estimator is completely 
determined by the second moment of its generic, i.i.d. building block Z%, we 
will drop the superscript k hereafter. Note that n does not stand for sample 
size, but for the large deviation parameter. 


3. Statement of the main result. The adaptive importance sampling 
scheme we consider dynamically selects the change of measure (or the pa¬ 
rameter a) in the form suggested by Remark 2.2, according to the sample 
history. Naturally, the scheme is closely related to a control problem. Let the 
control a n = {a”(-, •), j = 1,... , n — 1} be given, where each a™ : S x —> M d 
is a Borel-measurable function. Then the state dynamics are governed by 

3 - 1 

= J = 0,l,...,n. 

i =0 


Here we set Y 0 = Y 0 = y 0 , and for j > 1, YJ 1 is conditionally distributed, 
given {Tj n , i = 0,1,..., j — 1}, according to 


Vj(dy) = exp {(a? 


g{y)) — H{oij)} 


r(y\oft) 


■ p(Yj l _ 1 ,dy) 


3 3 3 >J r(YJ l _ 1 ;a'j) ■ 7 " 

with (abusing notation a bit) a" = otj(Yj l _ 1 ,S‘j/n). 

An unbiased estimator of P{S n /n G A} is defined as the average of inde¬ 
pendent copies of 


Xn — l{S£/n£A} ex P 


n —1 

£(-(«?.«(*?)>+ »(<*?)) 

3 =1 


n— 1 


n 


r(YJL i;«7) 
r(Yf-,a]) ' 


Our goal is to minimize the second moment, hence the variance, of the 
summands X n by judiciously choosing the control a n . Thus, we consider the 
value function defined by 


V n (y 0 ) = mfE[X 2 n ] 


( n —1 

£(-2<a?,s(i7))+2ff(a?)) 
j=i 

W r 2 (y/Li;ap ' 

X f , ^{YJ 1 ] a™) 

For convenience we write V n (yo) as V n when no confusion is incurred. We 
also consider the log transform 

W n = --logV n . 
n 

We have the following result, which asserts the existence of asymptotically 
optimal adaptive importance sampling schemes. 


= inf E 

rv n 
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Theorem 3.1. Under Conditions 2.1 and 2.2, we have 

lim W n = 2 inf L(B). 

n—>oo /3eA 

The detailed proof is deferred to Appendix A. It is worth pointing out 
that the construction of asymptotically optimal or nearly optimal adap¬ 
tive schemes (i.e., selection of the control a n ) is implied by a dynamic pro¬ 
gramming equation (DPE) appearing in the proof. Since the proof is rather 
lengthy and technical, it makes sense to give an outline and some intuitive 
discussion below, so that readers can proceed to the construction of the 
adaptive schemes (Section 4), without having to delve into the technical 
details of the proof. 


Outline and intuition of the proof. Thanks to the discussion in Sec¬ 
tion 2.3, it suffices to show the lower bound 


(3.1) 


lim inf W n > 2 inf L(d). 

n fSeA 


The proof will utilize the DPE that is satisfied by W n . In order to do so, 
we first extend the dynamics. Abusing notation a bit, for x £ M rf , y £ S and 
i £ {0,1,..., to}, define the dynamics 


3- 1 _ 

S^ = nx + J2Ki, 

i—i 


j = i,...,n. 


Here we set T;,i = y, and for j > i + 1, Yff is conditionally distributed, given 
{y^, £ = i,..., j - 1}, according to 


V IM Z ) = exp{(o:!?,y(z)) - H(af)} 


r(z\a 7 j) 


P( Y U-i,dz), 


where a” = a r f (Yff _ x , S-'^ / n). The original control problem corresponds to 
a; = 0, i = 0,y = yo- Define analogously 


V n {x,y,i) 

= inf E 


n —1 


t { s^j neA } ex p \ + 2 #(«i)) 


n—1 2 (yri .„nV 

x r 


j=i +1 


r 2 (TA;a^) 


and its log transform 


W n (x,y,i) = --log V n (x,y;i). 
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The terminal conditions are 


V n (x,y;n) = 1 a(x), W n (x,y 1 n) = oo • 1a°(x). 


Since it is inconvenient to study a problem with an oo terminal condition, we 
instead work with a mollified version of the control problem. Let F: —> R 
be an arbitrary bounded and Lipschitz continuous function. Suppose that 
Vp is defined as V n , save that the indicator function 1 ^™ / n£j \] I s replaced 

by exp{— 2nF(S™ n /n)}. Similarly define 

(3.2) W$(x,y;i) = --log VP(x,y,i). 

n 

Since Vp is the value function of a control problem, one can write down 
the DPE for Vp. Substituting (3.2) in this DPE, one obtains an equation 
for Wp\ see (A.l). The proof of the desired inequality (3.1) is based on the 
analysis of this recursive equation for Wp. 

The relative entropy representation for exponential integrals ([14], Propo¬ 
sition 1.4.2) states that 

-R(tIIm) + J fd'y 

for all bounded and Borel measurable functions /. Applying this represen¬ 
tation formula to the equation (A.l) for Wp, one obtains 


(3.3) — log / e K x ^n(dx) = inf 

Js 76 V(S) 


Wp(x, y; i) 


(3.4) 


= sup inf 

a£R d 'Y£'P(S) 


J W£(x + ^g(y),z;i + lJj(dz) 

+ ^ ^( 7(01 \p(y, •)) + J (a,g{z))-/(dz) - H(a ) 


1 f r(z;a) , 
+ - / log -x 7 (dz) 

nj r(y,a) 


This equation suggests that Wp is the lower value of a discrete-time stochas¬ 
tic game. One of the two players of the game (the a-player) selects the pa¬ 
rameter a, and is the weaker player. The other player (the 7 -player) is the 
stronger player, and selects the distribution 7 that determines the evolution 
of the state. The right-hand side of (3.4) would take a simpler form if we 
could permute the sup and inf. However, this is not (in general) possible, 
since the last term 


(3.5) 



r(z; a) 
r(y;a) 


7 (dz) 


may not be concave in a. 
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This difficulty is also the main distinction from the setting of Cramer’s 
theorem where the Markov chain Y reduces to an i.i.d. sequence of random 
variables. The latter case gives r(x;a) = 1 and the unpleasant term (3.5) 
disappears, whence the min/max theorem can be applied to convert the 
DPE of Wp into a DPE associated with a control problem, which is much 
simpler to analyze than a game [16]. However, the interchange of sup and 
inf is not possible with (3.4) as written. 

The key idea to overcome this difficulty and to obtain a lower bound for 
Wp is as follows. Fix an integer m, and consider a variant of the game where 
the a-player is constrained to policies such that a must be constant over 
time intervals of length 1/m. This new game is even more favorable to the 
7 -player, whence it will have a smaller lower-value. Letting n go to infinity, 
the lower value of the new game converges to a function Up 1 , and we expect 

liminf Wp(x,y; [nk/m \) > lfp(x; k ), k = 0,1,... ,m. 


A bonus of taking the limit is that the troubling terms (3.5), which can be 
interpreted as part of the running cost, cancel off, and it is not difficult to 
guess that U"p should satisfy 


(3.6) 


Up‘(x; k) = sup inf 

aeR d ^ eRd - 


Uf[x + —fok + 1 
m 


H- (L({3) + (a, (3) — H(a )) 

m 


with terminal condition 


(3.7) Uf{x\m) = 2F{x). 

In the proof, Up is in fact defined recursively through equations (3.6) and (3.7). 

Equation (3.6) is much easier to analyze. Analogously to [16], one can 
show by a weak convergence argument that 

(3.8) liminf Upix, 0) > 2 inf {L(@) + Fix + /?)}, 

m-KX> /3eR d 

which in turn implies 

liminf Wp (x,y;0) > 2 inf {L(/3) + F(x + /?)}. 

n—> °° /3e R d 

Letting x = 0 and the modifier F tend to oo • 1 a c , one arrives at the desired 
inequality (3.1). □ 


The following result is useful in the identification of an optimal adaptive 
importance sampling scheme in Section 4. 
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Corollary 3.2. Fix an arbitrary x G M d , and a bounded Lipschitz con¬ 
tinuous function F:M. d — »M. Assume Condition 2.1, and define Up recur¬ 
sively by (3.6) with the terminal condition (3.7). Then 


lim U'p(x-,\tm\) = 2UF(x,t) VtG [0,1], 

m —>oo 


where 


(3.9) U f{.x, t) = inf {(1 - t)L(fi) + F(x + (1 - t){3)}. 


Proof. We will show the equality for t = 0. The case with general t G 
[0,1] is similar and thus omitted. 

Thanks to (3.8), it suffices to prove 


limsup?7p'(x;0) < 2Uf(x,0) = 2 inf {L(/3) + F(x + /?)}. 

m —>oo 0eM. d 

Fix an arbitrary (3 G W l . The recursive definition of U 7 p (3.6) and (2.3) yield 




Repeatedly applying this inequality for k = 0,1,..., m — 1, we arrive at 


Uf?(x; 0) < UP(x + + 2L(fi) = 2F(x + f3) + 2L((3) 


thanks to (3.7). This completes the proof. □ 

4. Implementation issues and examples. 

4.1. The limit control problem and implementation issues. Theorem 3.1 
establishes the existence of asymptotically optimal adaptive sampling schemes. 
However, it does not explicitly discuss the construction of such schemes. On 
the other hand, the proof of the theorem implies that one approach of con¬ 
struction would be to solve, numerically if need be, the DPE (3.4) associated 
with Wfi (W n equals Wfi when F = oo ■ 1a c )- However, this approach may 
not only require a lot of computation effort, but the resulting adaptive sam¬ 
pling control (i.e., control a n ) will directly depend on n. In general, one 
would prefer schemes without this dependence. 

An alternative approach is to consider the DPE associated with the limit 
problem of Up as m tends to infinity. To this end, we rewrite (3.6) as 


0 = sup inf Ah™ + — (L(/3) + (a,/3) — H(a)) , 


aSK d L m 
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where 

AUp = Up(x + —/3; fc + 1^) - Up(x- k). 

\ m ) 

Suppose that a t subscript denotes the partial derivative with respect to t, 
and that an x subscript denotes the vector of partials with respect to Xi, 
i = 1 ,d. Since Corollary 3.2 (for F bounded and Lipschitz continuous) 
asserts that 


lim U^ix]\tm\) = 2UF(x',t), 

m —>oo 

we have formally the approximation 

A UP « / -(3, (2 U F ) X ) + -(2 U F ) t . 

\m / m 

Substituting this back, we have 

0 = sup inf [((3, (2 U F ) X ) + (2 U F )t + L((3 ) + (a, (3) - H (a)] 
aGR d/3€M d 

= (2 U F ) t + sup inf \L((3) + (a + (2U F ) X , (3) - H(a)\. 
aeR d/3eK d 

Representing the infimum in terms of the Legendre transform H of L gives 
0 = (2 U F ) t + sup [-H(-a - (2 U F ) X ) - H{a)}. 

a£R d 


The strict convexity of H implies that 


(4.1) a*(x,t) = -(U F ) x (x,t), 
and that 

(4.2) 0 = (U F ) t -H(-(U F ) x ). 


Equation (4.1) identifies, at least formally, an optimal feedback control 
policy. However, this observation is not entirely satisfactory since Uf does 
not usually have an explicit solution, and even if there is an exact formula 
for Uf, the partial derivatives may not be defined for all time and spatial 
points. In order to obtain a formal characterization of a* that is more useful, 
we observe that, thanks to the definition (3.9) of Uf and the convexity of L, 
Uf is the value function of the deterministic control problem 


Uf(x, t ) = inf 


L(<]>(s))ds + F((j>{ 1)) 


where the infimum is over all absolutely continuous cj> which satisfy (f>(t) = x. 
It is straightforward to see from this control problem that an optimal control 
at (x,t) is the minimizer in (3.9), say (3*(x,t), thanks to the convexity of L. 
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The standard dynamic programming argument implies that Up (in a weak 
sense) satisfies the DPE 

0 = (U F ) t + inf [L{a) + (a, (U F ) X )] = (U F ) t ~ H{-(U F ) X ), 
aG s. d 

which, not surprisingly, is just equation (4.2). The optimal control /3*(x,t) 
is, at least formally, the minimizer in the DPE, or /3*(x,t ) and — (U F ) x (x,t) 
are conjugate. It follows that 

a*(x,t ) is conjugate to the minimizer /3*(x,t) in (3.9). 

At points where (U F ) x (x, t ) exists this definition gives a*(x, t ) = —(U F ) x (x, t). 

At points where (U F ) x (x,t) does not exist there are multiple minimiz¬ 
ing p*(x,t), and one should define a*(x,t) through conjugacy in any Borel 
measurable way. 

Remark 4.1. The original (unmollified) problem corresponds to F=oo-1a c - 
In this case, 

(4.3) /3*(x, t ) G argmin{(l — t)L(@ ): x + (1 — t)(5 G A}, 

and a*(x,t ) is its conjugate. 

4.2. Numerical examples. We give two numerical examples in order to 
illustrate the asymptotic optimality of the adaptive schemes, in general, 
and the pitfalls of the traditional importance sampling schemes. The first 
example is concerned with a simple Markov chain with two states, while the 
second example studies a discrete time Markov chain embedded in a tandem 
Jackson network with finite buffers. 


Example 4.1. Consider a simple finite-state Markov chain Y with state 
space S = {1, —1} and probability transition matrix 


Q 


p i -p 

1 o 


[Q(*>j)]2x2, 


for some constant p € (0,1). Define g : S —> R by g(x) = x, and S n = g(Yo) + 
9(Yi) + --- + g{Y„- 1 ) = Y 0 + Y 1 +■■■ + ¥„-!. 

Since Y is an irreducible finite-state Markov chain, the eigenvalue e H 
and eigenfunction r(-;a), as defined in (2.1) for a G M, are just the maxi¬ 
mal eigenvalue of the kernel [e“ J Q(i, j)] and the corresponding eigenvector, 
respectively. Simple algebra gives 


H(a) = log 


pe a + Vp 2 e 2a + 4(1 —p) 


2 


VaGK, 
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which is a convex function with 0) = 0, and an eigenvector 


' r(l; a) ' 


- e H(a)- 

r(-l:a) 


e a 


Therefore, for any given agl, the corresponding change of measure is 
represented by the probability transition matrix 


Qa — 


(4.4) 




r(i ; a) 


pe “ (1 — p)e 


a OL — i 

1 


0 


Let L be the convex conjugate of H. It is not difficult to check that L(/3) = oo 
if (3 < 0 or (3 > 1, and that for /3 G (0,1), 

L(f3 ) = sup[a/3 — H(a)] 

a£ R 


= ^ log 4(1 ^ 
2 g p 2 (l^/I 2 ) 


- log -—^ — - log(l — p) 

2 1 + /3 2 7 


with the minimizer 

(4.5) 

and 


a* = a* {(3) 


llog 4 ^-^ 2 
2 g p 2 (l-/3 2 ) 


L(0) = Ikn L(fJ) = -ilog(l -p), 


L(l) = limL(/3) = — logp. 


Furthermore, L(/3) = 0 if and only if (3 = ^'(0) =p/(2 — p). 

Assume Yo = 1. We are interested in estimating = P{S n /n G A} for 
the Borel set 


A = (—oo, a] U [6, oo), 0 < a < H'{ 0) < 6 < 1. 

In all the following discussion we take p = 1/2, a = 1/6, b = 1/2, which 
implies 

(4.6) hA L{(3) = L{b) < L(a). 

We will compare the naive Monte Carlo simulation, traditional importance 
sampling and adaptive importance sampling schemes below. 

The naive Monte Carlo simulation will simulate the Markov chain under 
the original transition probability kernel Q. One can also regard this as a 
special change of measure with the corresponding a = 0. In this case, the 
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estimate is just the sample mean of K i.i.d. replications of X n = 1 {s n /n&A}- 
Since the second moment of X n satisfies 

lim --log E[(X n ) 2 }= lim-log p n 

n —kx) ji n—>oo 77, 

= inf L(j3) 

/3gA v 

= L(b)<2L{b), 

the naive Monte Carlo sampling is not asymptotically optimal. 

Thanks to (4.6), the traditional importance sampling will take /3* = b, and 
a* is then defined by (4.5). The algorithm will generate a Markov chain Y 
with probability transition matrix Q a * and Yq = 1. Let 

S n = Lo + • ■ • + Lr— i- 


The estimate is the sample mean of K i.i.d. replications of 


X n = 


{Sn/n&A} 


n— 1 

II< 


0 -a*Y j+ H(a*) 


tt r(Yj-i-,a*) 
/=i r{Yj-a*) 


~ 1 { S n /neA} e 


—a*Sn+nH(a*) . c a*Y 0 -H(a*) r (^b; Q '*) 

r(Y n -i) a*) 


Since r(-;ct*) is clearly bounded from above and bounded away from zero, 
it is not difficult to see that 


1- - 1 


lim-log E[(X n ) 2 }= lim-logL^l^ /neA} e 

n^oc ji n^oo n 


—2 n(a* Sn/n—H(a* 


n — kx) 77, 


»]■ 


Simple computation yields that {S n /n} satisfies the large deviation principle 
with rate function L(/3) = L(/3) + H(a*) — a*/3. Now one can apply the 
Varadhan’s theorem ([14], Theorem 1.3.4) (with slight modification) to show 

lim --logM(X n ) 2 ] = inf [2a*f3-2H(a*) + L({ 3 )\ 

n—>oc n /3eA 


= mf [a*P - H(a*) + L(/3)]. 

In the configuration of this example, the infimum in the right-hand side is 
achieved at (3 — a, and 

lim -- log E[(X n ) 2 } =aot* -H(a*) + L(a) <2L(b). 

n—>oo 77, 

Therefore, the traditional importance sampling scheme is not asymptotically 
optimal either. 
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Table 1 

Naive Monte Carlo scheme 



No. 1 

No. 2 

No. 3 

No. 4 

Estimate p n (%) 

3.11 

3.20 

3.23 

3.09 

Standard error (%) 

0.17 

0.18 

0.18 

0.17 

95% confidence interval (%) 

[2.76, 3.46] 

[2.85, 3.55] 

[2.88, 3.58] 

[2.74, 3.44] 


In Section 3 we argued the existence of asymptotically optimal adaptive 
importance sampling schemes in general. The construction of such adaptive 
schemes involved the selection of a nearly optimal control a n = {a ™(•, •): j = 
0, 1,..., n — 1}. It was formally suggested in Section 4.1 that a good choice is 
to sample YJ 1 , conditional on {Y™, i = 0,..., j — 1}, according to the transi¬ 
tion probability matrix Q a as in (4.4) with a being the conjugate of f3*(x,t) 
given in (4.3), where x = S™/n = (Y 0 n + • • • + Yp_P)/n and t = 1 — j/n. In 
case the conjugate of (3*(x,t) is oo or — oo, a is taken as a large positive 
or negative number; see Remark 4.3 for more details. The estimate is the 
sample mean of K i.i.d. replications of 


x n — t{gri/n£A} ex P 


n —1 

3 =1 


n —1 

n 


3= 1 


r(Yp_ ,;«!?) 

r{Yp-ap ' 


The numerical results show that the controls constructed in this way have 
asymptotically optimal performance (Table 6). 

The numerical results are reported in Tables 1-3 for n = 60. The theoret¬ 
ical value of p n is 


p n = P{S n /n < a} + P{S n /n > b} 

= 0.83% + 2.44% = 3.27%. 

See Remark 4.2 for the computation of this theoretical value. For each table, 
we run four simulations each with sample size K = 10,000. 


Table 2 

Traditional importance sampling scheme 



No. 1 

No. 2 

No. 3 

No. 4 

Estimate p n (%) 

2.41 

2.48 

2.44 

16.71 

Standard error (%) 

0.04 

0.04 

0.04 

14.22 

95% confidence interval (%) 

[2.34, 2.48] 

[2.41, 2.56] 

[2.37, 2.51] 

[-11.73, 45.15] 
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Table 3 

Adaptive importance sampling scheme 



No. 1 

No. 2 

No. 3 

No. 4 

Estimate p n (%) 

3.17 

3.21 

3.35 

3.33 

Standard error (%) 

0.15 

0.13 

0.17 

0.18 

95% confidence interval (%) 

[2.86, 3.47] 

[2.85, 3.47] 

[3.00, 3.69] 

[2.96, 3.70] 


An interesting observation is that the traditional importance sampling 
scheme exhibits seemingly bizarre and inconsistent simulation results (Ta¬ 
ble 2). Similar phenomenon also occurs in the setting of Cramer’s theorem, 
that is, where the Markov chain Y reduces to a sequence of i.i.d. random 
variables; see [16, 19]. The explanation is also very similar. Under the al¬ 
ternative sampling distribution Q a *, most of the sample means S n /n will 
end up near the point b. However, a few samples (“rogue” trajectories) have 
means that fall into the interval (—oo,a]. Even though the “rogue” trajec¬ 
tories are rare, the Radon-Nikodym derivatives associated with them are so 
large that they dominate the variance. In simulation No. 4, the presence of 
a single “rogue” trajectory greatly raises the standard error associated with 
the estimate. Indeed, the proportion of the contribution to the second mo¬ 
ment from this single “rogue” trajectories is more than 99%. In simulations 
No. 1, No. 2 and No. 3, however, there are no “rogue” trajectories, and the 
standard error associated with the estimate is deceptively small. The rea¬ 
son is that the standard error is itself estimated from the sample. Without 
“rogue” trajectories, we actually underestimate the standard error. There¬ 
fore, we cannot put much confidence in the standard errors thus obtained 
or in the “tight” confidence intervals that follow. Indeed, the confidence 
intervals from these three simulations do not contain the true value. 

In contrast, the adaptive importance sampling, on the other hand, yields 
more accurate estimates and its performance is much more stable. Even 
though it does not show great advantage over naive Monte Carlo simulation 
for n = 60, it quickly does so when n gets larger. The numerical results for 
different n (with K = 10,000 fixed as before) are reported in Tables 4-6. 

The naive Monte Carlo does not work well for bigger n. For n = 120 and 
n = 180, it yields estimates with large standard errors, and for n = 240, the 
simulation yields an estimate 0, that is, no sample mean reaches the target 
set A. As for the traditional importance sampling, each simulation gives a 
very “tight” confidence interval, due to the absence of “rogue” trajectories. 
However, as discussed before, we cannot put much belief into these estimates. 
Indeed, none of these confidence intervals cover the true value of p n . 

On the other hand, the adaptive importance scheme yields much more 
accurate estimates. In Table 6, the variable V n denotes the sample estimate 
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of the second moment E[(X n ) 2 ]. Observe that as n gets larger, the ratio 

-(l/n)log£[(X n ) 2 ] _ -log E[(X n ) 2 ] _ -\ogV n 
-(l/n)logp n -log Pn -log P n 

approaches 2. In other words, the adaptive importance sampling scheme is 
approaching optimality. 

Remark 4.2. The theoretical value of p n can be computed as follows. 
Let X n be the number of — l’s in a trajectory, that is, 

n— 1 

X n = Yl 

j =0 

Since Yq = 1 and Q(— 1,1) = 1, we have 0 < X n < n/2 with probability one. 
Clearly, 

— "R 2jX n , 

whence it suffices to compute P(X n > m) for all nonnegative integers m 
such that 2m < n. But if we define Tj = inf{j > 0 - Yj = —1}, then T\ > 1 
and Tj — 1 is geometrically distributed with parameter (1 —p). Moreover, 
lr 1 = — 1, and lj+Ti = 1- Now recursively define for i > 2, T % = inf{j > 
1 4- Tj_i: Yj = —1}. Then {T\ — 1, T 2 — T\ — 2, T 3 — T 2 — 2, ... } is clearly a 
sequence of i.i.d. geometrically distributed random variables with parameter 
(1 — p), and 

P(X n >m) = P(T m < n) 

= P(T m — 2m + 1 < n — 2m + 1). 

But 

T m — 2m + 1 = (Ti — 1) + (T 2 — Tj — 2) + • • • + (T m — T m _ 1 — 2) 

is the sum of i.i.d. geometrically distributed random variables, whence has 
a negative binomial distribution with parameter m and (1 —p). Standard 
softwares such as SPLUS contain the cumulative distribution functions of 
negative binomial distributions, and can easily yield the desired probabili¬ 
ties. 


Table 4 

Naive Monte Carlo simulation 



n = 120 

n = 180 

n = 240 

Theoretical p n 

1.61 x 10 -3 

9.66 x 10~ 5 

6.35 x 10~ 6 

Estimate 

1.80 x 10" 3 

20.00 x 10“ 5 

0 

Standard error 

0.42 x 10 -3 

14.14 x 10 -5 

NA 

95% confidence interval 

[0.95,2.65] x 10“ 3 

[-8.28,48.28] x 10 -5 

NA 
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Table 5 

Traditional importance sampling scheme 



n = 120 

n = 180 

n = 240 

Theoretical p n 

1.61 x 10“ 3 

9.66 x 10“ 5 

6.35 x 10“ 6 

Estimate p n 

1.40 x 10” 3 

8.76 x 10“ 5 

6.01 x 10 -6 

Standard error 

0.02 x 10~ 3 

0.18 x 10 -5 

0.13 x 10 -6 

95% confidence interval 

[1.35,1.45] x 10~ 3 

[8.41,9.12] x 10" 5 

[5.74,6.28] x 10“ 6 


Remark 4.3. If j3*(x,t) > 1, then its conjugate is a*(x,t ) = +oo, in the 
sense that 

L((3*(x,t)) = sup[a/3*(x,t) — H(a)]= lim [a(3*(x, t) — H{ot)\. 

a a —>+oo 

The corresponding change of measure (at least formally) is 

Q +oo = lim Q a = 

a —>+oo 

Similarly, if /3*(x,t) < 0, then its conjugate is a*(x,t) = —oo, with the cor¬ 
responding change of measure 

Q —oo — lim Qol — 

a —>—OO 

However, neither of these two probability transition kernels is suitable for 
the purpose of importance sampling, since the probability measure induced 
by the original probability transition kernel Q is not absolutely continuous 
with respect to the probability measure induced by Q+oo or Q - 00 . 

To overcome this difficulty, we just take a to be a large positive or neg¬ 
ative number whenever a*(x,t) = +oo or a*(x,t) = — oo. In our numerical 
simulation, a is taken to be 5 if a*(x,t) = +oo and —5 if a*(x,t) = — oo. 
The probability transition kernels corresponding to a = ±5 are 

0.0047 0.9953' 

1 0 


Q +5 


0.9999 

1 


0.0001 

0 


0 1 
1 0 


1 0 
1 0 


Table 6 

Adaptive importance sampling scheme: asymptotic optimality 



n = 120 

n = 180 

n = 240 

Theoretical p„ 

Estimate p„ 

Standard error 

95% confidence interval 
(— log V n )/ (— ldgPn) 

1.61 x 10 -3 

1.56 x 10“ 3 

0.04 x 10 -3 
[1.49,1.63] x 10“ 3 
1.72 

9.66 x 10“ 5 

9.73 x 10“ 5 

0.15 x 10 -5 
[9.44,10.02] x 10 -6 
1.87 

6.35 x 10 -6 
6.29 x 10" 6 
0.07 x 10 -6 
[6.15,6.43] x 10" 6 
1.93 
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Table 7 

Traditional importance sampling scheme 



No. 1 

No. 2 

No. 3 

No. 4 

Estimate p n (xlO -5 ) 

2.14 

2.37 

2.29 

9.20 

Standard error (x 10 5 ) 

0.11 

0.15 

0.14 

6.85 

95% confidence interval (xlO -5 ) 

[1.92, 2.36] 

[2.07, 2.67] 

[2.01, 2.57] 

[-4.50, 22.90] 


which are very close to Q±oo- 

Example 4.2. Consider a two-node tandem Jackson network with ar¬ 
rival rate A and consecutive service rates 51 , 52 - We assume the queueing 
system is stable, that is, A < min{/ri, 52}, and, without loss of generality, 
A + hi + /j .2 = 1. The sizes of the first buffer and the second buffer are de¬ 
noted by B\ and B 2 , respectively. Both buffer sizes are assumed to be finite. 

We will work with the embedded discrete-time Markov chain Y = {Yi = 
(Yd, Y- 2 ) :z = 0,1,...}, representing the queue lengths of the nodes at the 
epochs of transitions in the network. The chain Y is irreducible and with 
finite state space S = {( 2 / 1 , 2 / 2 ) - 2/i = 0,1,..., Bf, i = 1,2}, whence uniformly 
recurrent. It is assumed throughout this example that the initial state is 
T 0 = (0,0). 

We are interested in estimating a class of probabilities associated with 
buffer overflow. More precisely, define 5 = ( 51 , 52 ) —> {0, l } 2 by 

9 l{y) = 1 {j/i=Bi}, 52(5) = l{y 2 =B 2 } 

for every y = ( 51 , 52 ) G S, and let S n = g(Y 0 ) + g{Y \) -f-f- g(Y n - 1 )- We 

wish to estimate p n = P{S n /n E A} for some Borel set A of form 

A = {(x\,x 2 ) :xi>£i or x 2 > £ 2 } C R 2 , 

where 0 < £ 1,62 < 1. Note that the set A is nonconvex. 

The construction of the traditional and adaptive importance sampling 
schemes are very similar to Example 4.1. However, here the function H: R 2 —> 
R and its conjugate L :R 2 —> M + do not admit closed-form expressions, and 
are computed numerically. 


Table 8 

Adaptive importance sampling scheme 



No. 1 

No. 2 

No. 3 

No. 4 

Estimate p n (xl0~ 5 ) 

3.96 

3.93 

4.18 

4.16 

Standard error (xlO -5 ) 

0.17 

0.15 

0.30 

0.16 

95% confidence interval (xl0~ 5 ) 

[3.62, 4.30] 

[3.63, 4.23] 

[3.58, 4.78] 

[3.84, 4.48] 








22 


P. DUPUIS AND H. WANG 


Analogously to Example 4.1, if we let /3* be the minimizer that attains 
inf{L(/3): f3 G A} and let X n denote the traditional importance sampling 
estimate, then we have 

(4.7) lim --log E[(X n ) 2 } = inf [a*/3 — H(a*) + L(J3)\, 

n >oo fi (3dzA 

where a* is the conjugate of j3*. It is not difficult to see that the traditional 
importance sampling scheme is asymptotically optimal if and only if (3* is 
also a minimizer to the right-hand side of (4.7). However, this is often not 
the case, due to the nonconvexity of set A; see [16] for more discussion on 
this issue. 

The simulation results for the traditional and adaptive schemes are re¬ 
ported in Tables 7 and 8. For comparison, the theoretical value of p n is also 
obtained via recursively computing the conditional distribution of g(Y] c ) + 
g(Yk+ 1 ) + • • • + g{Y n - 1 ) given Yy.. for each k = n — l,n — 2,... ,0. Unlike 
Example 4.1, we choose not to report the results from naive Monte Carlo 
simulation (which is not asymptotically optimal). Actually, the naive Monte 
Carlo simulation, often giving an estimate 0 or an estimate with intolera¬ 
bly large standard error, is far inferior to either of the importance sampling 
schemes. 

We choose B\ = E >2 = 6, and A = 0.2, pi = P 2 = 0.4. The state space S 
consists of ( B\ + 1)(E>2 + 1) = 49 states. Set n = 50 and £\ = 0.3, £2 = 0.4. 
Analogously to Example 4.1, one can check that the traditional importance 
sampling is not asymptotically optimal. Indeed, the infimum of L(/3) over 
set A is attained at (3* ~ (0.02,0.4), while the minimizer for the right-hand 
side of (4.7) is /? « (0.3,0.01). 

Each table consists of four simulation runs each with sample size K = 
10,000. The theoretical value is p n = 4.10 x 10 -5 . 

The explanation for the behavior of traditional importance sampling (Ta¬ 
ble 8) is quite similar to that of Example 4.1—most of the sample means will 
end up near point /3*, while a few “rogue” trajectories will have means near 
point (3. Even though these “rogue” trajectories are rare, they carry huge 
Radon-Nikodym derivatives. Without the presence of “rogue” trajectories 
(simulations No. 1, No. 2 and No. 3), we have tight confidence intervals 
that we cannot put much faith in. With the presence of “rogue” trajectories 
(simulations No. 4), we get an estimate with very large standard error. On 
the contrast, the performance of adaptive schemes is much more stable and 
much better. 

Similar phenomenon is also observed for various sets of parameters. We 
just list some numerical results in Tables 9 and 10 for the same setup, except 
the arrival rate and service rates are now (A,^i>/u) = (0.1,0.4,0.5). The 
sample size K = 10,000 is fixed as before. The erratic behavior of traditional 
schemes is more conspicuous. The asymptotic optimality of adaptive schemes 
is also clear from these numerical results. 
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Table 9 

Traditional importance sampling scheme 



n = 50 

n = 80 

n = 110 

Theoretical p n 

5.15 x 10“ 9 

3.47 x 10- 12 

1.83 x 10“ 15 

Estimate p n 

0.83 x 10~ 9 

o.8i x nr 12 

0.53 x 10" 15 

Standard error 

0.03 x 10~ 9 

0.02 x 10- 12 

0.01 x 10 -15 

95% confidence interval 

[0.77,0.89] x 10~ 9 

[0.77,0.85] x 10~ 12 

[0.51,0.55] x 10~ 15 


APPENDIX A 


Proof of Theorem 3.1. Proposition 2.1 and Condition 2.2 imply that 
lim — log P{S n /n £ A} = — inf L(0). 

rwoo n p gA 

Thanks to the discussion in Section 2.3, it suffices to show the lower bound (3.1), 
or 


Urn inf W n > 2 inf L{3). 

n 0&A 


To this end, we extend the dynamics as in Section 3, and consider a mollified 
version of the original control problem. In other words, let F :W l —be 
an arbitrary bounded and Lipschitz continuous function, and define Vp, Wp 
correspondingly; see the discussion from (3.1) to (3.2). 

Since Vp is the value function of a control problem, it satisfies the Bellman 
equation [4] 


Vf( x :VV) 


= inf f e - 2 ( a ’9V))+ 2H { a ) 
JS 


r 2 (y,a) v n 

r 2 (z\a) F 


(x + -g(z),z;i 
V n 



X e (a,g(z))-H(a) . r Jp^L p {y^ dz ) 

r{y,a) 


Table 10 

Adaptive importance sampling scheme: asymptotic optimality 


n = 50 


n = 80 


n = 110 


Theoretical p„ 

5.15 x 10 -9 

3.47 x 10 -12 

1.83 x 10 -15 

Estimate p n 

4.82 x 10“ 9 

3.36 x 10" 12 

1.76 x 10~ 15 

Standard error 

0.18 x 10~ 9 

0.11 x 10 -12 

0.07 x 10' 15 

95% confidence interval 

[4.46,5.18] x 10" 9 

[3.14,3.58] x 10" 12 

[1.62,1.90] x 10 

(-log %")/(-logp„) 

1.86 

1.91 

1.92 
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= inf / e -^9{ z ))+ H( y°) . 


r(y;a) 


aSR d Js r(z 

together with terminal condition 

Vp{x, y, n) = exp{—2nF(x)}. 
It follows from (3.2) that 


^-Vj? (x + -g(y), z;i + l)p(y,dz), 
■a) \ n ) 


(A.l) 


W2{x,yi) = -- log inf [ e -Mz))+H{a) 
n aeR d Js 


X 4^1 e - n ^ (3:+1 / n9( 2 /) ’ 2:i+1) p(y,d2) 

r{z;a) 


and that Wp(x,y;n) = 2 F(x). 

The discussion in Section 3 now prompts the following definition. Fixing 
an arbitrary mgN, for 0 < k < m — 1, define recursively 

1 

m 


Up'(x-, k) = sup inf Up ( x + — ft\ k + 1 


(A.2) 




H- (L(ft) + (a, ft) — H{a )) 

m 


given the terminal condition 

(A.3) Uf{x-,m) = 2F{x) VrG M d . 

See Section 3 for the interpretation of Wp and Up 1 as lower values of games. 
The key observation is the following lemma, whose proof is deferred to Ap¬ 
pendix C. 

Lemma A.l. For an arbitrary sequence x n —> x £ M. d , we have 
liminf inf Wp(x n ,y; \nk/m\) > U"p(x; k), k = 0,l,...,m. 

n—>oo y£S 


Assume Lemma A.l holds for the moment. All that remains to show is 
the inequality 

(A.4) liminf Up’(x; 0) > 2 inf {Lift) + F(x + ft)}. 

m^o o /3eR d 

Indeed, suppose (A.4) is true. Fix an arbitrary j £ N, and define Fj(y) = 
j(d(y,A) A 1), which is bounded and Lipschitz continuous. Since 1 a(u) < 
exp{—2 nFj(y)}, we have 

lim inf W n > lim inf Wp. (0,yo',0) 

> liminf U'p.iO: 0) 

m —>oo ' 
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Exactly as in [14], pages 10 and 11, a compactness argument shows that 
lim inf {L((3) + iL(/3)} = inf ' L(j3), 

j— »oo /3eu d /3e A 

and we complete the proof. 

Now we show inequality (A. 4). The idea is to represent U™ as the value 
function of a control problem with the help of the min/max theorem. To 
this end, define 

C = e V{R d ) : J L(f3)9(d/3) < ooj 
and rewrite (A.2) as 

Up (x;k)= sup inf f U™ fx + — {3; k + 1^) 9(d/3) 

IJ \ m J 

+ HPW(dP) + (a, J139{dp- H(a)^j . 

We make the following useful observation, whose proof is deferred to Ap¬ 
pendix C. 


Lemma A.2. Up , {-]k) is bounded and Lipschitz continuous for every k. 

Indeed, 

\\UF(x;k)\\<2\\F\\ 00 \/x£R d , fc = 0,l,...,m, 

and Up‘(-\ k ) is Lipschitz continuous with Lipschitz constant 2 Lp, where Lp 
is the Lipschitz constant for the mollifier F. 


The next lemma is a version of min/max theorem, whose proof is almost 
identical to [16], Lemma 2.2, and thus omitted. 


Lemma A.3. For any bounded and lower semicontinuous function f:R d 
1, we have 


sup inf 

aeR d sec 


f{(3) d9 + j L{f3) d9+(a,J pdd\-H(a ) 


= inf sup 

0 eC «eK d 


f(P)d9 + J L(/3)d9 + (a,J /3df)\—H{a ) 


Thanks to Lemmas A.2 and A.3, we obtain 


Upfx: k) = inf sup 


j u^(x+^p-,k+iy(dp) 
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(A.5) 


: inf 
eec 


+ ^(/ L(j3)e(dp) + (a, J pern) 

j uF(x + ^p-,k + ije(dp) 

+ ^ L(P)0(dP) + L (/ P8( d P) 


— H(a] 


This last display implies that Up 1 has an interpretation as the minimal 
cost of a stochastic control problem. To simplify the notation, we state 
the control problem only for the case k = 0. The control problem will be 
defined on a probability (Cl,P,P), and E x will denote that the initial con¬ 
dition of the state process is x. An admissible control is a sequence {iA 71 , j = 
0,1,..., m — 1}, with each v™ being a stochastic kernel on given M. d . Given 
an admissible control sequence, the state dynamics are defined by S™ = mx 
and 


where 


QUl • 
^' + 1 “ 


S™ + Y™, 


P{Y™ € dy\Y™,0 <i<j} = P{Y™ € dy\Sf/rn} = uf (dy\Sf / m). 
We then define the value function 





yv?(dy) 


+ 2 F(SZ/m) , 


where the infimum is taken over all controls {iA”} and resulting controlled 
processes {Sj l /m} that start at x at time 0. Since Vp also satisfies the 
DPE (A.5) ([4], Chapter 8) and terminal condition Vp{x\m) = Up , (x\m) = 
2 F(x), we obtain by induction that U™(x;k) = Vp , (x\k) for all x S and 
k £ {0,... ,m}. 

Define a stochastic kernel v m on R ‘ 1 given [0,1] by 


v m (dy\t) 


v™{dy), if t G [j/m, (j + 1 )/m), j = 0,1, ... ,m - 2, 
CiWi [(m-l)/m, 11- 


Let A denote Lebesgue measure. Then the definition of v m {dy\t) and the 
convexity of L imply that 


Uf{x\ 0) = inf E, 




cl 


Jo J R d 

772 — 1 


L(y)u m (dy\t) dt 


+ E f V^(dy))+2F(S^/m) 

~T 0 m \jRd j 
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> inf E x 
V?} 



L(y)u m (dy\t) dt 



+ 2 F(SZ/m) 


= inf E x [ L(y)v m (dy x dt) 
{v™} l_./R d x[0,l] 


+ l( [ yv m {dy x dt)) +2F(SZ/m) , 
\JR d x[0,l] / J 

where v m (dy x dt) = v m (dy\t)dt. A straightforward weak convergence ap¬ 
proach will be adopted to derive the desired inequality (A.4). Since the 
proof is essentially the same as [14], Theorem 5.3.5, we only give a sketch. 

For each e > 0, there exist a sequence of controls {v m ,m E N} such that, 
for every m, we have 


Uf {x-{)) + £> E x 


/R d x[0,l] 


L{y) dv m + L 


/R d x[0,l] ' 


ydu m )+2F(SZ/m) 


Furthermore, since L is nonnegative and F is bounded, we have 


sup E x / L(y)u" n (dy x dt) < oo. 

JR d x[0,l] 

However, since function L is superlinear (Proposition 2.1), it is not difficult 
to check that {v m } is uniformly integrable in the sense that 


lim sup E x f \\y\\v m (dy x dt) = 0. 

C^oo meN 4{y: ||y||>C}x[0,l] 


It follows from that proof of [14], Proposition 5.3.2, that {is 171 } is indeed 
tight. Therefore, we can extract a weakly convergent sub-subsequence, still 
denoted by {^ m }, such that v m => v for some stochastic kernel v whose 
second marginal is Lebesgue measure ([14], Lemma 5.3.4). We utilize the 
Skorokhod representation [6], which allows us to assume (when calculating 
the limits of the integrals) that the convergence is actually w.p.l. It follows 
from the uniform integrability of {v m } and the proof of [14], Proposition 
5.3.5, that 


/R d x[0,l] 


yv m (dy x dt) 


/R d x [0,1] 


yv(dy x dt) 
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Furthermore, it follows from the lower-semicontinuity and nonnegativity of L 
([14], Lemma A.3.12) that, with probability one, 


lim inf 


/R d x[0,l] 


L{y)n rn {dy x dt ) > 


'R d x[0,l] 


L(y)v(dy x dt). 


Thanks to convexity of L and Jensen’s inequality, we have 


i 


R d x[0,l] 


L{y)v(dy x dt) > L 


/R^xp,!] 


yu(dy x dt)). 


By Fatou’s lemma and the lower-semicontinuity of L [28], we have 


lim inf U™ (x; 0) + e > E x 


2 L 


yv{dy x dt) ) + 2 F{Z) 


/R d x[0,l] 

It is now trivial that the right-hand side of the last inequality is bounded 
below by 

2 inf [L((3) + F{x + (3)). 


Since e > 0 is arbitrary, (A.4) follows readily, which completes the proof. □ 


APPENDIX B 

A large deviation upper bound. In this section we study a uniform 
large deviation principle upper bound, which is essential for proving the key 
Lemma A.l. We present a proof based on the weak convergence approach 
[14]. Alternatively, one can adapt the methodology in [13]. 

The following two lemmas will be useful: 


Lemma B.l. Suppose S is a Polish space and V(S) is the space of proba¬ 
bility measures on S endowed with the weak convergence topology. Consider 
a sequence of random variables p n : (D n , T n ,P n ) —>V(S). In other words, 
{p n } is a sequence of random probability measures. Then {yT} is tight if 
and only if the sequence {E n y n } is tight. Here E n y n G V(S) is defined by 

(E n y n )(A) = f y n (u)(A)P n (dco) 

Jn n 

for every Borel set A inV(S). 

Proof. See [23], Theorem 6.1, Chapter 1. □ 

Lemma B.2. Suppose S is a Polish space, {y n } C V(S), and p(-,-) a 
probability transition kernel. If y n —> y in the t- topology for some y£V{S), 
then 
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in the t- topology. Here p®p denotes the probability measure on S x S given 
by 

(p<S>p)(B) = / fi{dx)p{x, dy) 

Jb 

for every Borel set B C5 x S. 


Proof. It suffices to show that 
f (x, y)p n (dx)p(x, dy) 


)SxS 


ISxS 


f(x,y)p(dx)p(x,dy) 


for every bounded, measurable function /. Since p r ‘ 
it remains to show that 


p in the T-topology, 


L 


f(x,y)p(x,dy) 


is a bounded and measurable function (over x). The boundedness is trivial, 
and the measurability follows from Fubini’s theorem; compare [51, Exer¬ 
cise 18.20. □ 


Proposition B.3. Suppose Y = {Yj,j E N} is a Markov chain that 
takes values in a Polish space S. Let p denote the probability transition ker¬ 
nel ofY, and assume Condition 2.1 holds. Suppose g:S —>W l is a bounded 
measurable function, and define H and L as in (2.1)-(2.3). Then for any 
fixed a E W l , bounded and continuous function f :M. d —► R, and sequence 
x n —> x E R d , we have 


lim inf inf-log E v 

n-»oo y£S n 



where 


71—1 


exp< -nfi x n + -Y, 9 (yj) 


3=0 




If{x) = inf [f{x + (3) + L((3) + {a,/3)]. 


Proof. Let 


v n {x,y ) = - -log Ey 


n 


71—1 


1 


71 — 1 


exp j-^a, Y, g{ Y j)j | exp j ~ n f g( Y j ) 


71 — 1 


n r- n 

j=0 

71—1 


= -- {o e J ex Pl -\ a ^ 9 (Vj)) [expj -nf\x + ~Y J g{y. 


dir 


3=0 


3=0 


e a 
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Here 7r” is the joint distribution of (Wo, Y \,..., W n _i), or 

^y(dyo,dyi ,..., dy n ) = 5 y (dy 0 )p(y 0 ,dy 1 )p(yi,dy 2 ) ■ ■ ■ p{y n -i,dy n ). 

Clearly v n is bounded, thanks to the boundedness of g and /. It suffices to 
show that for every sequence x n —► x and {y n } C 5, 


(B.l) liminf v n (x n ,y n ) > If(x). 

For an arbitrary e > 0, the relative entropy representation of exponential 
integrals (3.3) ([14], Proposition 1.4.2) yields the existence of a probability 
measure \i n on 5 ri+1 such that 


(B.2) 


»”(*",»”)+£> \R(p. n IK~) + (a,J I £ 9fe) 


In particular, it is not hard to see that 


(B.3) 


sup — R(p n \\'K™n) < oo. 
neN n 


We can factor p n as in [14], Theorems A.5.4 and A.5.6: 

y n (dy 0 , dyi,...,dy n )= Ho(dyo)vi{dyi\y 0 ) ■ ■ • p™(dy n \y n -i,y n - 2 , ...,yo). 

Now consider a probability space (H,W, P), on which we define a stochastic 
process given by 

P(Y 0 n edy) = ^(dy 0 ), 

P(Y? +1 G dy\Y?, i = 0,1,..., j) = fJ? +1 (dy\Y?, i = 0,1,... ,j) 

for j = 0,1,..., n — 1. To ease exposition, let 

fij+iidy) = rfj+i(dy\Y?,i = 0,1,..., j), 

which is a random probability measure on S. Also define a random proba¬ 
bility measure on S x S by 


n —1 


7 n {dx x dy) = ~Yl dyp{dx) x /Y- +1 (dy), 
whose marginals are 


n n 

j=o 


-j n— 1 n—1 

= = U b”)2 = 

3=0 3=0 
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Thanks to the chain rule [14], Theorem B.2.1, we have 

(B.4) -R^ n \\^ n ) = -E 

n y n 


flKIIV) + EAP”+i(')llp(L.')) 


However, 


n— 1 




n n 

3=0 


n— 1 


= - E R {&Yn{dy) x fJ,j + i(dz)\\5y P (dy) x p(Y?,dz )) 


n — -o 

3=0 
^ n—1 

= - E R ^Y n ( d y) x M"+i(^)ll^yn(^y) ®p(y,dz)) 
n j=0 

/ -, n-i 1 n—1 \ 

> R - E d Y n (dy) X Aj +1 (dz) - E dyn(dy) ®p{y,dz) 

\ n j= o J n i=o J / 

= i?( 7 n ||Z n <g).p), 

where the inequality follows from the convexity of relative entropy T?(- 1 |•)- 
Thanks to (B.2), and observing that 


J-Y.9(ViW = Bj 9 dL', 

3=0 

J f (x n + l E <?(%)) dy n = Ef[x n + Jg dL^j , 


we arrive at 
v n (x n ,y n )+e>E 


R( n n \\L n ®p) + (a, gdL n ) + f (x n + gdL 


It suffices to show { 7 n } is tight. Indeed, if this is true, the same argument as 
in [14], Theorem 8.2.8, allows us to extract a weak convergent subsequence 
of ( 7 n 1 L n ), still indexed by n, such that 

( 7 ^) ^( 7 ,L) 

for some stochastic kernel 7 on S x S and some stochastic kernel L on 5, 
and a (random) transition probability function q such that 

7 (dy x dz) = L(dy) <g> q(y, dz) 
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and 

(B.5) Lq = L 

hold almost surely. In particular, we have 

(7 n )2 => ( 7)2 = Lq = L. 

Note (B.5) says that L is indeed the invariant measure for the transition 
probability function q. Also observe that 

supER{ 7 n \\L n ®p) < 00 . 

ngN 

This implies the existence of a subsequence, still indexed by n, such that 

L n ->L, (7 n h-L 

in the r-topology; see the proof of [14], Lemma 9.3.3. Therefore, 

Jg dL n —> f gdL 

almost surely. Furthermore, thanks to Lemma B.2, L n <S> p ^ L®p in the 
T-topology (hence, in the weak-topology) almost surely. The lower semi¬ 
continuity of T?(-1|-) implies 

lirn inf Rh n \\ L n <g> p) > Rh\\ L <g> p). 

n —>00 

It follows readily from Fatou’s lemma that 
liminfu n (x n ,y n ) +e 

n—>-oo 


> E 


R('y\\L®p) + (a, / gdL) + fix + gdL 


= E 


R{L®q\\L®p) + {ot, j gdl^j + f^x + J gdL 
R{p®q\\p®p) J r(ot,j gdp'j + f(x + J g dp 


> inf 

Recalling (2.4) and letting e —> 0, we obtain 

liminf v n (x n , y n ) > inf [L(/3) + (a, /3) + f(x + /?)], 

which is the desired inequality (B.l). 

It remains to show the tightness of { 7 ”}- All we need is the tightness 
of the two marginals, {( 7 ^) 1 } and {( 7 ^) 2 } - However, it is not difficult to 
observe that 


1 


71 — 1 


n —1 


£(7 n h = EL n = -J2 E5 9n = - ]T 

n “ j n “ 
3 =0 




n,j 


3=0 
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where p n ’ J denotes the jth marginal of the probability y n and, similarly, 

1 n —1 ^ n —1 

E(rh=-zm + 1 —X>- i+1 . 

j=o i=o 

Letting || • || v denote the total variation metric, we have 

(B.6) ||£(7 n )i - E( 7 n ) 2 ||v = -|K'° - ^ n,n !lv < -■ 

n n 

If we can show {( 7 n ) 2 } is tight, then Lemma B.l implies {E( y n ) 2 } is tight, 
which in turns yields the tightness of {E( 7 n )i}, thanks to (B.6). Applying 
Lemma B.l once again, we have the tightness of {(7 n )i}- Therefore, it is 
sufficient to show that {( 7 n ) 2 } is tight. The proof will distinguish two cases: 
mo = 1 and mo > 1. 

Suppose that mo = 1. Note that the nonnegativity of relative entropy, 
(B.3) and (B.4) imply 



It follows from the assumption of uniform recurrency 
av p {-)<p{y,-)<bv p {-) Vy£S, 

that 

^ + i(Ollpft n c))>ci?(^ + ill^) 

for some constant c > 0. It is now easy to derive from the convexity of relative 
entropy that 

2 n —1 

supER(('y n ) 2 \\i'p) < sup E- V R(ftj +1 \\v p ) < oo, 

neN neN n J=0 

which further implies the tightness of {( 7 n ) 2 } since R(-\\i , p) is a tightness 
function on V{S). Note that 

n—1 

E( 7 n ) 2 = -E ^ nJ+1 

is also tight, thanks to Lemma B.l. 

The general case with mo > 1 is slightly more complicated. We will give a 
proof with mo = 2, and observe that the proof for mo > 2 is essentially the 
same and thus omitted. Without loss of generality, we show {( 7 n )2 : n even} 
to be tight. The tightness for {(y n )2 - n odd} is similar. 
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To ease notation, let 7r n A 5 and n n,e be the marginal distribution of 
7T n over even coordinates; that is, 

ir n ’ e (dy 0 ,dy 2 ,. . .,dy n - 2 ,dy n ) = 5 y {dy 0 )p (2) {y 0l dy 2 ) ■ ■ -p (2) (yn-2, dy n ). 

One can similarly define /r n,e , or 

H n ’ e (dy 0 ,dy 2 ,.. .,dy n ) = ^ (dy 0 )^2 ,e (<*/2|yo) ■ ■ ■ aC' e (dy n \y n -2, ■ ■ -,y2,yo)- 

Thanks to the chain rule ([14], Theorem B.2.1) and nonnegativity of the 
relative entropy, we have 


R(p n ’ e \\ir n ’ e ) < R(ii n \\ir n ), 

and, thus, sup n j- L R(y n ’ e \\Tr n ’ e ) < oo. With the same proof as for the case 
mo = 1, we have that 


2 

n 


(n/2) —1 

]T /x n ’ e ’ i+1 


3=0 


is tight; here y 


n,e,J is the jth marginal of // n,e ; that is, 
V n ' e ’ j (dy 2j ) = fi n ’ e (S ,.. • ,S,dy 2 j,S,... ,S). 


One can similarly define /j, n, ° as the marginal distribution of p n over odd co¬ 
ordinates, and the same argument can be carried over to prove the tightness 
of 


2 

n 


(n/2) 1 


^2 n n ’°d +1 

3=0 


However, observe that 

H n ’ e ’i = , p n ’°’i = p n ^ +1 . 


We have 


1 

2 



(n/2)-l 


^2 y. n ’ e ’ j+1 

3=0 


9 ("/ 2)—1 \ 

+ - J 2 ^o,j+l 

3=0 * 


n 


1 n —1 

j2i^ +1 =E(rh. 

3=0 


This implies the tightness of {E('y n ) 2 :n even}, which is equivalent to the 
tightness of {( 7 n )2 -n even}, thanks to Lemma B.l. □ 
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APPENDIX C 

Proofs of Lemmas A.l and A.2. 


Proof of Lemma A.2. That £/]?(■; k) is Lipschitz continuous with Lip- 
schitz constant 2 Lp follows trivially by induction and the terminal condi¬ 
tion (A.3). 

As for the boundedness of U"p(-;k), we first show it is bounded from 
below. Since H{ 0) = 0 and L is nonnegative, definition (A.2) gives 


Up(x]k)> inf 


1 


Up 1 1 x H-/3; k + 1 ) H-L(/3) 


m 


1 


m 


> inf Up (x + — /3; k + 1^) 
/3eR d V m ) 


= inf up(z;k + 1) 
z£R d 

for every x. It follows that, for every k, 


inf U'S'(x;k)> inf U'p(x-,k + 1) > ■ • ■ > inf Up’(x;m)>—2\\F\\ 00 . 
ieM d xm d xeM d 

It remains to show that Up is bounded from above. Let /? be a subdifferential 
of the convex function H at a = 0. Then 


L(f3) = sup [(a, 0) — H (a)] = 0 

a€ R d 

and the supremum is achieved at a = 0. By definition (A.2) again, we have 

UP(x;k)< sup \u™(x+-0,k + l) +-((aJ)-H(a)) 
aeR d l \ m ) m \ 

= Up(x+-0,k + l) 

\ m ) 

< sup Up'(z; k + 1) 
zew d 


for every x. It follows that, for every k, 

sup U™{x\ k) < sup U™(x; k + 1) <■■■ < sup U 7 p{x]m) < 2 \\F 
x£S. d a;GR d oigR^ 

This completes the proof. □ 


Proof of Lemma A.l. The proof is by induction. For k = m, we have 
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[nk/m\ =n. By definition, 


liminf inf Wp(x n ,y:n) = liminf inf 2 F(x n ) = liminf 2F(x n ), 

n—»oo y( zS 1V1 > n _ toQ yeS V ) K n 

and Lemma A.l follows trivially from the continuity of F. 

Assume now the claim holds for k + 1. Let l(n) = \ n{k + l)/mj — [nk/m\. 
Also, let 7 be the probability measure on S^ +1 defined by 

ir J y {dyo,dyi,. ..,dy-j) = S y {dy 0 )p(y 0 ,dy 1 )p{y 1 ,dy 2 ) ■ ■ -piyj-^dyj) 

for every y £ S and every j £ N. 

For an arbritrary a £ R rf , let 


Ua, F {x',k ) = inf 


Up 1 (x-\ -/3; A: + 1^ H-(L(/3) + (ck , /?) — H(a )) 

V m J m 

It follows from the definition that Up 1 (x; k ) = sup Q U™ F (x; k). Therefore, all 
we need to show is that, for every a £ M. d and any sequence x n —► x, 

liminf inf Wp(x n ,y, \ nk/m\) > U™ F (x; k). 

71—>00 y£S ’ 

However, for an arbitrary fixed a £ W 1 ', the dynamic programming prin¬ 
ciple implies that 

W F (x,y; [ nk/m \) 

1 r ( / ^ n ^ \ 1 n ) / . \ 

> --log J ex p|“^ a >Xjff(^)y +£(n)H(a) \ ■ ? 


x exp< 


J =1 r(yy,a) 
-nW F lx + - Y 9(yj),ye(n)\ 

V n j= 0 

[n(k + 1)/r?7,)J j | dn^ 


= -^ lo g J exp|-^a,^5(7/j)^ +£(ra)if(a)j 


r(yo',a) 


r(yi{ny,<x) 

r / ^ £(n)-i 

x expj-ralTj^z + - ^ g{yj),yOn ); 

[n(fc + l)/m)J ] | d-Ky^. 
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Since g is bounded and r(-;a) is both bounded from above and bounded 
away from zero by (2.2), it suffices to show 

(C.l) liminf inf v F (x n ,y, 0) > U™ F (x;k), 

Tl >00 

where 


VF(x,y\ 0) 


^ _ n l0g / eXP { _ ( a ’ ^ 9{Vi)^+t(n) H (°‘)^ 

x expj-nVFp^x + ^ g(yj),y^ n y,[n(k + l)/m\j 


dn$ n \ 


We claim that inequality (C.l) is a direct consequence of 


(C.2) 

where 

VF(x,y, 0 ) 


liminf inf v^(x n , 
n —>00 y£S 


yO)>U™ F (x-k), 



xexpj -nC/p^x + i 9 (Vj)',k + lj j dn e y {n) 


Indeed, since i{n) < n, one can always find a compact set K C M. d such that 


^ t(n)~ 1 

+ - 51 9 (Vj)£K V(j/ 0 ,j/i,...,%( n) ), VneN, 

n i=o 

thanks to the boundedness of <? and the assumption x n —> x. It is also not 
hard to show by contradiction from the induction hypothesis and the con¬ 
tinuity of Up 1 (Lemma A. 2) that, for any e > 0, there exists N(e) G N such 
that for all x G K and n> N(e), 


inf W F (x,y ; |_ n(k + l)/m\) 

y£S 


U F (x-,k+ 1) > —£. 


We arrive at 


liminf inf v F (x n , y; 0) > liminf inf v F (x n , y; 0) — e 

n— KX) y£S n—>OC y£S 


for every e > 0. It follows that (C.l) is implied by (C.2). 
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It remains to show (C.2), which is an easy consequence of the uniform 
large deviation bound Proposition B.3, Lemma A.2, boundedness of g, and 
that 

£(n) 1 

—- ►0 . 

n m 

This completes the proof. □ 

Acknowledgments. We are indebted to Associate Editor and the referee 
for their careful reading of the first version and many helpful suggestions. In 
particular, Example 4.2 is inspired by the comments from Associate Editor. 

REFERENCES 

[1] Asmussen, S. (1985). Conjugate processes and the simulation of ruin probability. 

Stochastic Process. Appl. 20 213-229. MR808158 

[2] Asmussen, S. (1989). Risk theory in a Markovian environment. Scand. Actuar. J. 89 

69-100. MR1035668 

[3] Asmussen, S., Rubinstein, R. and Wang, C. L. (1994). Regenerative rare event 

simulation via likelihood ratios. J. Appl. Probab. 31 797-815. MR1285517 

[4] Bertsekas, D. and Shreve, S. (1978). Stochastic Optimal Control : The Discrete 

Time Case. Academic Press, San Diego, CA. MR511544 

[5] Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley, New York. 

MR1324786 

[6] Breiman, L. (1968). Probability Theory. Addison-Wesley, Reading, MA. MR229267 

[7] Bucklew, J. A. (1990). Large Deviations Techniques in Decision, Simulation and 

Estimation. Wiley, New York. MR1067716 

[8] Bucklew, J. A. (1998). The blind simulation problem and regenerative processes. 

IEEE Trans. Inform. Theory 44 2877-2891. MR1672152 

[9] Bucklew, J. A., Ney, P. and Sadowsky, J. S. (1990). Monte Carlo simulation and 

large deviations theory for uniformly recurrent Markov chains. J. Appl. Probab. 
27 44-59. MR1039183 

[10] Chen, J., Lu, D., Sadowsky, J. S. and Yao, K. (1993). On importance sampling in 

digital communications. Part I: Fundamentals. 

IEEE J. Selected Areas in Comm. 11 289-307. 

[11] Collamore, J. F. (2002). Importance sampling techniques for the multidimensional 

ruin problem for general Markov additive sequences of random vectors. Ann. 
Appl. Probab. 12 382-421. MR1890070 

[12] Cottrel, M., Fort, J. C. and MALGOUYNES, G. (1983). Large deviations and rare 

events in the study of stochastic algorithms. IEEE Trans. Automat. Control 
AC-28 907-920. MR734079 

[13] de Acosta, A. (1990). Large deviations for empirical measures of Markov chains. 

J. Theoret. Probab. 3 395-431. MR1057523 

[14] Dupuis, P. and Ellis, R. S. (1997). A Weak Convergence Approach to the Theory of 

Large Deviations. Wiley, New York. MR1431744 

[15] Dupuis, P. and Kushner, H. J. (1987). Stochastic systems with small noise, analysis 

and simulation; a phase locked loop example. SIAM J. Appl. Math. 47 643-661. 
MR889644 





DYNAMIC IMPORTANCE SAMPLING 


39 


[16] Dupuis, P. and Wang, H. (2005). Importance sampling, large deviations, and differ¬ 

ential games. Ann. Probab. To appear. MR2115034 

[17] Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering. Springer, 

New York. MR1999614 

[18] Glasserman, P. and Kou, S. (1995). Analysis of an importance sampling estimator 

for tandem queues. ACM Trans. Modeling Comp. Simulation 4 22-42. 

[19] Glasserman, P. and Wang, Y. (1997). Counter examples in importance sampling 

for large deviations probabilities. Ann. Appl. Probab. 7 731-746. MR1459268 

[20] Glynn, P. W. (1995). Large deviations for the infinite server queue in heavy traffic. 

IMA Vol. Math. Appl. 71 387-394. MR1381021 

[21] Heidelberger, P. (1995). Fast simulation of rare events in queueing and reliability 

models. ACM Trans. Modeling Comp. Simulation 4 43-85. 

[22] ISCOE, I., Ney, P. and Nummelin, E. (1985). Large deviations of uniformly recurrent 

Markov additive processes. Adv. in Appl. Math. 6 373-412. MR826590 

[23] Kushner, H. J. (1990). Weak Convergence Methods and Singularly Perturbed 

Stochastic Control and Filtering Problems. Birkhauser, Boston. MR1102242 

[24] Lehtonen, T. and Nyhrinen, H. (1992). Simulating level crossing probabilities by 

importance sampling. Adv. in Appl. Probab. 24 858-874. MR 1188956 

[25] Lehtonen, T. and Nyhrinen, H. (1992). 

On asymptotically efficient simulating of ruin probabilities in a Markovian environ¬ 
ment. Scand. Actuar. J. 1 60-75. MR1193671 

[26] Ney, P. and Nummelin, E. (1987). Markov additive processes I: Eigenvalue properties 

and limit theorems. Ann. Probab. 15 561-592. MR885131 

[27] Ney, P. and Nummelin, E. (1987). Markov additive processes II: Large deviations. 

Ann. Probab. 15 593-609. MR885132 

[28] Rockafellar, R. T. (1970). Convex Analysis. Princeton Univ. Press. MR274683 

[29] Sadowsky, J. S. (1991). Large deviations and efficient simulation of excessive 

backlogs in a GI/G/m queue. IEEE Trans. Automat. Control 36 1383-1394. 
MR1135652 

[30] Sadowsky, J. S. (1993). On the optimality and stability of exponential twisting in 

Monte Carlo estimation. IEEE Trans. Inform. Theory 39 119-128. MR1211494 

[31] Sadowsky, J. S. (1996). On Monte Carlo estimation of large deviations probabilities. 

Ann. Appl. Probab. 6 399-422. MR1398051 

[32] Sadowsky, J. S. and Bucklew, J. A. (1990). On large deviations theory and asymp¬ 

totically efficient Monte Carlo estimation. IEEE Trans. Inform. Theory 36 579- 
588. MR1053850 

[33] Shahsbuddin, P. (1994). Importance sampling for the simulation of highly reliable 

Markovian systems. Management Sci. 40 333-352. 

[34] Siegmund, D. (1976). Importance sampling in the Monte Carlo study of sequential 

tests. Ann. Statist. 4 673-684. MR418369 


Division of Applied Mathematics 
Brown University 
Providence, Rhode Island 02912 
USA 

E-MAIL: dupuis@dam.brown.edu 
e-mail: huiwang@cfm.brown.edu 


